Emerging only at the end of 2019 and declared by WHO as a global pandemic in March of 2020, COVID-19 has wrought an unprecedented devastation on global population health. By 1 January 2021 83.2 million confirmed cases and 1.8 million deaths had been reported to WHO. And as of May 13 2021, there are 160.1 million confirmed COVID-19 cases and 3.3 million deaths that have been reported to WHO (an additional 76.9 million cases and 1.5 million deaths in 2021 so far).
However, the reported COVID-19 burden has given an incomplete picture of the true impact of covid. There are a number of reasons for this:
As a result of this, in many cases, deaths caused by COVID-19 have been incorrectly attributed to other causes.
The excess mortality measure is crucial for accurately quantifying impact. The WHO defines excess mortality as “the mortality above what would be expected based on the non-crisis mortality rate in the population of interest”. Excess mortality is thus the mortality that is attributable to the crisis conditions.
To consider the impact of COVID-19, the mortality rate for year 2020, week \(t\) and age \(x\), can be represented by \[m^o_{x, t}\] and is assumed to be a result of the direct effects of COVID-19 (deaths attributable to it) and the indirect knock-on effects on health systems and society. These indirect effects impact the number of deaths due to other natural as well as external causes. The hypothetical or “counterfactual” no-COVID-19 scenario has expected mortality rate \[m^e_{x, t}\] which is estimated as the average of the most recent death rates (prior to the pandemic) or the forecast of historical rates to 2020. Excess mortality, represented by \(\mu_{x, t}\), can thus be defined as the difference: \[ \mu_{x, t} = m^o_{x, t} - m^e_{x, t} \] evaluated in rate-space to account for population changes and estimated as an absolute rate difference or percentage change. This gives a more comprehensive picture of how a location has been affected by the disease.
There is a data gap to overcome. Routine mortality data is usually received on an annual basis after the year of death or following an even longer lag. Civil registration and vital statistics (CRVS) systems differ across countries with varying timeline and quality control measures for compiling unit record cause-of-death numbers into aggregates indentified by cause, age, sex, place and period of death. In addition, differential reporting capacity and variable data quality across countries has resulted in many nations lacking the systems to provide good quality routine data in the past. Correspondingly, they also lack the capacity and data required to monitor all-cause mortality during this unprecedented pandemic. This results in a number of countries being unable to contribute to a centralized systematic mortality surveillance that would be needed to measure global, regional and country level excess mortality by the WHO.
To track excess mortality globally requires the development of tools and strategies for capturing all-cause mortality that account for the current state of irregular mortality tracking as well as imperfect country capacity. This will be achieved using three complementary exercises:
Over the next few sections, more detail is provided on the preliminary statistical modeling exercises.
The code below lists the data sources used as input for the statistical analysis and also shows the wrangling of the data to produce the main input file. URL links for all data files are provided in the code to facilitate reproducibility; and a download link for the final analysis data set is found below the code chunk.
#####################################################################################################################################
# House keeping
rm(list = ls(all = TRUE))
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
pacman::p_load(data.table, lubridate, tidyr, dplyr, countrycode)
#####################################################################################################################################
# Additional files for analysis
#####################################################################################################################################
# 1) Codes for mapping iso2 to iso3 for WHO public data
# 2) Total population numbers by country for 2020 (WPP2019 for 194 Member States)
# 3) GHE data by broad cause group, age, sex and year 2000 to 2019 (183/194 member states)
# 4) GBD data by broad cause group, age, sex and year 2000 to 2019 (11/194 member states)
ccodes <- fread("https://www.dropbox.com/s/hoefrsvbk3lz389/ccodes.csv?dl=1")
population <- fread("https://www.dropbox.com/s/nfmbu67o03kw8o4/population.csv?dl=1")
gheg3 <- fread("https://www.dropbox.com/s/cigt1ahro12qers/group3causes_ghe_183.csv?dl=1") %>%
mutate(sex = factor(sex, labels = c("Male", "Female")),
causename = case_when(causename == "All Causes" ~ "All",
causename == "Communicable, maternal, perinatal and nutritional conditions" ~ "Communicable",
causename == "Noncommunicable diseases" ~ "Noncommunicable",
TRUE ~ as.character(causename))) %>%
select(iso3, year, age, sex, causename, dths)
gbdg3 <- fread("https://www.dropbox.com/s/5hpa957mbzk9mrs/group3causes_gbd_11.csv?dl=1") %>%
rename(agel = age, causename = cause, dths = val) %>%
left_join(data.table(iso3 = c("AND","COK","DMA","MHL","MCO","NRU","NIU","PLW","KNA","SMR","TUV"),
location = c("Andorra", "Cook Islands", "Dominica", "Marshall Islands","Monaco",
"Nauru", "Niue", "Palau", "Saint Kitts and Nevis", "San Marino", "Tuvalu")), by = "location") %>%
left_join(data.table(age = seq(0,85,5), agel = c("Under 5", paste(seq(5,80,5),"to",seq(5,80,5)+4),"85 plus")), by="agel") %>%
mutate(sex = factor(sex, levels = c("Male", "Female")),
causename = case_when(causename == "All causes" ~ "All",
causename == "Communicable, maternal, neonatal, and nutritional diseases" ~ "Communicable",
causename == "Non-communicable diseases" ~ "Noncommunicable",
TRUE ~ as.character(causename))) %>%
select(iso3, year, age, sex, causename, dths)
# Combine the GHE and GBD based broad cause numbers
group3_194 <- rbind(gheg3, gbdg3) %>%
group_by(iso3, year, causename) %>% summarise(dths = sum(dths), .groups = "drop") %>% ungroup() %>%
spread(causename, dths) %>% arrange(iso3, year) %>% right_join(population, by = c("iso3","year"))
# Use the 2019 rates for NCDs, Communicable diseases and injuries as potential covariates
covariate.ghe<- group3_194 %>%
filter(year == 2019) %>% select(-c(year, All)) %>%
gather(cause, val, -iso3, -pop) %>% mutate(val=1e5*val/pop) %>% spread(cause, val) %>% mutate(pop=NULL)
#####################################################################################################################################
# Time-series random walk order 2 for forecast of expected deaths in 2020
#####################################################################################################################################
# Function to forecast the GHE all cause deaths for a single year
expected.ghe <- function(){
set.seed(12345)
iso.list <- sort(unique(group3_194$iso3)) # vector of iso3 to loop through;
ex.list <- list() # empty list to store expected values in;
for (i in 1:194){
print(paste0(round(100*i/194,1), "%"))
group3.subset <- group3_194 %>% filter(iso3 == iso.list[i]) %>%
arrange(year) %>% select(iso3, year, All) %>% mutate(lall = log(All))
fit <- inla(lall ~ f(year, model = "rw2"), data = group3.subset, control.predictor= list(compute=TRUE))
fit.out <- data.table(expected = exp(fit$summary.fitted.values$mean),
fit.l = exp(fit$summary.fitted.values$"0.025quant"),
fit.u = exp(fit$summary.fitted.values$"0.975quant"))
ex.list[[i]]<- cbind(group3.subset %>% mutate(lall = NULL), fit.out)
}
rbindlist(ex.list)
}
# takes a few min to run the function
# expected.out <- expected.ghe()
# fwrite(expected.out, file = "Data/expected.out.csv", row.names = F)
# loading the saved output from run
expected.out <- fread("https://www.dropbox.com/s/yjr1s99s2cmdpdb/expected.out.csv?dl=1")
ghe.expected <- expected.out %>%
filter(year == 2020) %>% select(iso3, expected) %>%
mutate(expected = round(0.25*expected), observed = NA)
ghe.expected <- rbind(ghe.expected %>% mutate(quarter = "q1"),
ghe.expected %>% mutate(quarter = "q2"),
ghe.expected %>% mutate(quarter = "q3"),
ghe.expected %>% mutate(quarter = "q4")) %>% arrange(iso3, quarter)
#####################################################################################################################################
# WHO COVID case and death data by date and country
#####################################################################################################################################
# Read in the WHO data and clean up some of the names
who_df <- fread("https://covid19.who.int/WHO-COVID-19-global-data.csv") %>%
mutate(Country_code = ifelse(Country == "Namibia", "NA", Country_code)) %>%
left_join(ccodes, by = "Country_code") %>%
mutate(Country = case_when(Country == "Curaçao" ~ "Curacao",
Country == "Saint Barthélemy" ~ "Saint Barthelemy",
Country == "Réunion" ~ "Reunion",
Country == "Kosovo[1]" ~ "Kosovo",
Country %in% c("Bonaire","Saba","Sint Eustatius") ~ "Bonaire, Sint Eustatius and Saba",
TRUE ~ as.character(Country)),
iso3 = case_when(Country == "Kosovo" ~ "XKX",
Country == "Saint Martin"~ "MAF",
Country == "Sint Maarten"~ "SXM",
Country == "Bonaire, Sint Eustatius and Saba" ~ "BES",
Country == "Curacao" ~ "CUW",
Country == "Saint Barthelemy" ~ "BLM",
Country == "South Sudan" ~ "SSD",
Country == "Namibia" ~ "NAM",
TRUE ~ as.character(iso3)),
date = as.Date(Date_reported,"%y-%m-%d")) %>%
rename(cov_cases = Cumulative_cases, cov_deaths = Cumulative_deaths)
# Subset to the 194 member states and aggregate cases and deaths by quarter of death
who_2020_194 <- who_df %>%
filter(date =="2020-04-01" | date =="2020-07-01"| date =="2020-10-01" | date =="2021-01-01") %>%
mutate(period = case_when(date == "2020-04-01" ~ "p1",
date == "2020-07-01" ~ "p2",
date == "2020-10-01" ~ "p3",
date == "2021-01-01" ~ "p4")) %>%
select(Country, iso3, WHO_region, period, cov_cases, cov_deaths) %>%
gather(source, val, -Country, -iso3, -WHO_region, -period) %>%
group_by(Country, iso3, WHO_region, period, source) %>%
summarise(val = sum(val, na.rm = T), .groups = "drop") %>% ungroup() %>%
spread(period, val) %>%
mutate(q1 = p1, q2 = p2 - p1, q3 = p3 - p2, q4 = p4 - p3) %>%
select(-c(p1, p2, p3, p4)) %>%
gather(quarter, val, -Country, -iso3, -WHO_region, -source) %>%
spread(source, val) %>%
right_join(population %>% filter(year == 2020), by = "iso3") %>%
mutate(WHO_region = factor(WHO_region,
levels = c("AFRO","AMRO","EMRO","EURO","SEARO","WPRO"))) %>%
arrange(iso3, quarter) %>%
group_by(iso3) %>%
mutate(cc_all = 1e5*sum(cov_cases)/pop, cd_all = 1e5*sum(cov_deaths)/pop) %>% ungroup()
#####################################################################################################################################
# SDI data from GBD and Income group from World Bank
#####################################################################################################################################
# SDI, filter for country locations (leave out region or subnational; & Georgia in the USA)
sdi.loc <- fread("https://www.dropbox.com/s/vhajcoyzw08wb8l/sdi.country.csv?dl=1")
temp <- tempfile(fileext = ".xlsx")
dataURL <- "http://ghdx.healthdata.org/sites/default/files/record-attached-files/IHME_GBD_2019_SDI_1990_2019_Y2020M10D15.XLSX"
download.file(dataURL, destfile=temp, mode='wb')
sdi.gbd <- readxl::read_excel(temp, skip = 1)[-c(105), ] %>%
filter(Location %in% sdi.loc$Location & Location != "Virgin Islands" ) %>%
mutate(iso3 = countrycode(Location, "country.name", "iso3c")) %>%
gather(year, sdi, -Location, -iso3) %>%
mutate(sdi=as.numeric(gsub( "·", ".",sdi)), year = as.numeric(year), Location = NULL) %>%
filter(year == 2019 & iso3 %in% unique(population$iso3)) %>%
select(-c(year))
# Income group from World Bank, assume Cook Islands and Niue are upper middle income like Tonga and Tuvalu
temp <- tempfile(fileext = ".xls")
dataURL <- "http://databank.worldbank.org/data/download/site-content/CLASS.xls"
download.file(dataURL, destfile=temp, mode='wb')
wbincome <- readxl::read_excel(temp, skip = 4) %>%
filter(Code != "x") %>%
select(Code, `Income group`) %>%
rename(iso3 = Code, group = "Income group") %>%
filter(iso3 %in% unique(population$iso3))
wb_oth <- wbincome %>%
filter(iso3 %in% c("TON","TUV")) %>% mutate(iso3 = c("COK", "NIU"))
wbincome <- wbincome %>% rbind(wb_oth)
#####################################################################################################################################
# Observed compiled data from Karlinsky et al and multiple sources
#####################################################################################################################################
# The baseline are forcasted by time using the linear regression function
# Will compare the aggregate for Ariels forecasts to the aggregate GHE forecasts
baseline_data <- fread("https://raw.github.com/dkobak/excess-mortality/main/baselines.csv") %>%
rename(country = V1, time = V2, expected = V3) %>%
filter(!(country %in% c("Kosovo", "Transnistria"))) %>%
mutate(iso3 = countrycode(country, "country.name", "iso3c"), country = NULL)
# Pull the collated observed data from World Mortality Data
# This step is replicated using HMD, EUROMOMO etc when doing age-disaggregated analysis
observed.df <- fread("https://raw.github.com/akarlinsky/world_mortality/main/world_mortality.csv", header = T) %>%
filter(!(country_name %in% c("Kosovo", "Transnistria",
"Uruguay", "Belarus", "Nicaragua","El Salvador")) & year == 2020 &
time_unit != "quarterly (Solar Hijri)") %>%
mutate(iso3 = countrycode(country_name, "country.name", "iso3c")) %>%
left_join(baseline_data, by = c("iso3","time")) %>%
mutate(quarter = ifelse((time_unit == "weekly" & time >= 1 & time <= 13)|
(time_unit == "monthly" & time >= 1 & time <= 3)|
(time_unit == "quarterly" & time == 1), "q1",
ifelse((time_unit == "weekly" & time >= 14 & time <= 26)|
(time_unit == "monthly" & time >= 4 & time <= 6)|
(time_unit == "quarterly" & time == 2), "q2",
ifelse((time_unit == "weekly" & time >= 27 & time <= 39)|
(time_unit == "monthly" & time >= 7 & time <= 9)|
(time_unit == "quarterly" & time == 3), "q3", "q4")))) %>%
group_by(iso3, quarter) %>%
summarise(observed = sum(deaths, na.rm = T), expected = sum(expected, na.rm = T), .groups = "drop") %>% ungroup()
expected.df <- ghe.expected %>% filter(!(iso3 %in% observed.df$iso3)) %>% rbind(observed.df)
#####################################################################################################################################
# Oxford tracker data on government policy (https://www.nature.com/articles/s41562-021-01079-8)
# Use medians for the quarter to get summary statistic
# Infill missing countries with WHO region means
covid_pol <- fread("https://github.com/OxCGRT/covid-policy-tracker/raw/master/data/OxCGRT_latest.csv")
covid_pol_df <- covid_pol %>%
rename(iso3 = CountryCode) %>%
mutate(quarter = ifelse(Date < 20200401, "q1", ifelse(Date < 20200701, "q2", ifelse(Date < 20201001, "q3", "q4")))) %>%
group_by(iso3, quarter) %>%
summarize(Stringency = median(StringencyIndex, na.rm = T),
Government = median(GovernmentResponseIndex, na.rm = T),
Containment = median(ContainmentHealthIndex, na.rm = T),
Economic = mean(EconomicSupportIndex, na.rm = T), .groups = "drop") %>%
ungroup() %>% filter(!is.na(Stringency)) %>%
right_join(who_2020_194 %>% select(iso3, WHO_region, quarter), by = c("iso3", "quarter")) %>%
group_by(WHO_region, quarter) %>%
mutate(Stringency = ifelse(is.na(Stringency), mean(Stringency, na.rm = T), Stringency),
Government = ifelse(is.na(Government), mean(Government, na.rm = T), Government),
Containment = ifelse(is.na(Containment), mean(Containment, na.rm = T), Containment),
Economic = ifelse(is.na(Economic), mean(Economic, na.rm = T), Economic)) %>%
ungroup() %>% mutate(WHO_region = NULL)
#####################################################################################################################################
# Dataset to use in preliminary statistical models
#####################################################################################################################################
# Now bring all the data together
# Create some rate variables from counts as well as interactions
analysis_mod <- who_2020_194 %>%
left_join(expected.df, by = c("iso3","quarter")) %>%
left_join(covid_pol_df, by = c("iso3","quarter")) %>%
left_join(covariate.ghe, by = "iso3") %>%
left_join(sdi.gbd, by = "iso3") %>%
left_join(wbincome, by = "iso3") %>%
arrange(iso3, quarter) %>%
rename(cmr = Communicable, injr = Injuries, ncdr = Noncommunicable) %>%
mutate(group = factor(ifelse(group == "High income", "HIC",
ifelse(group == "Upper middle income", "UMIC", "LMIC")),
levels = c("LMIC", "UMIC", "HIC")),
cfr = ifelse(cov_cases == 0, 0, cov_deaths/cov_cases), # deaths per case
expcov = expected + cov_deaths, # sum of expected and covid
excess = round(observed - expected), # excess deaths
ratio = observed/expcov, # ratio of observed to sum of expected and covid
Nx = pop*1e-5, # per 100,000 population
expcovr = expcov/Nx, # expected plus covid rate
excessr = excess/Nx, # observed - covid rate
expectedr = expected/Nx, # expected rate
covidr = cov_deaths/Nx, # covid rate
casesr = cov_cases/Nx, # cases rate
covsdi = covidr * sdi, # interaction covid and SDI
ncdsdi = ncdr * sdi, # interaction NCDs and SDI
quarter = as.factor(quarter), # factor of quarter
lratio = log(ratio), # natural log of ratio
sdrat = sd(lratio, na.rm = T), # standard deviation of log ratio
mrat = mean(lratio, na.rm = T), # mean of log ratio
zratio = (lratio - mrat)/sdrat, # standardized log ratio
class = factor(cut(right = FALSE, x = covidr, # direct COVID impact burden broad gategory
breaks = c(0, 1, 10, 20, 40, 70, 1e5)),
labels = paste(c(0, 1, 10, 20, 40, 70))),
index = ifelse(is.na(excess), 0, 1), # index on observed vs missing
row = 1:nrow(who_2020_194)) %>% # row index
select(Country, iso3, WHO_region, group, index, pop, Nx, quarter,
observed, expected, ratio, excess, excessr, expectedr, class,
cov_deaths, covidr, cov_cases, casesr, expcov, expcovr, cc_all, cd_all,
sdi, covsdi, ncdr, ncdsdi, cmr, cfr, sdrat, mrat, zratio,
Stringency, Government, Containment, Economic, row)
save(analysis_mod, file = "Data/analysis_mod.Rda")The analysis data used for the models can be downloaded as an R-database using the analysis_mod.RDa link below:
One of the key inputs in the estimation of excess mortality is the baseline or counterfactual i.e., the expected deaths in the absence of COVID-19. As shown the in data input, one way that these can be determined for the aggregate of the year is through a one-year step forecast. THis is performed using the GHE data for the historical trend. The forecast assumes a second order random walk (RW2) model which has the density \[\pi(y) \propto exp\left(-\frac{1}{2}\sum_{i=3}^n (y_i - 2y_{i-1} + y_{i-2})^2\right)\] where the term \(y_{i+1} −2y_i +y_{i−1}\) can be interpreted as an estimate of the second order derivative of a continuous time function \(y(t)\) at \(t = i\) using values of \(y(t)\) at \(t = i − 1\), \(i\), and \(i+1\). The plots below show the historical mean values from the GHE as well as the crresponding forecasted estimates and uncertainty. (Note only the mean is used as input). In addition, where available, the expected values for 2020 according to World Mortality Data are also shown.
The first option to predicting excess mortality is to build a log-linear model that can be applied according to a Generalized linear models (GLMs) framework. A GLM has the structure
\[g\big(\mu_i\big) = \boldsymbol{x}_i \boldsymbol{\beta}\]
where \(\mu_i\) is the expectation of response variable \(Y_i\) such that \(\mu_i \equiv \mathbb{E}\big(Y_i\big)\), \(g\) is a link function, \(\boldsymbol{x}_i\) is the \(i\)th row of the covariate matrix \(\boldsymbol{X}\), and \(\boldsymbol{\beta}\) is a vector of unknown coefficients. Akin to linear models, GLMs center around a linear predictor, \(\boldsymbol{X}\boldsymbol{\beta}\), and allow for link functions other than the identity function. GLMs also allow for any distribution from the exponential family to model the dependent variable instead of just assuming it Gaussian. We predict the log of the ratio
\[\mu_i = log(s) = log\left(\frac{Observed}{Expected + covid19}\right) = \boldsymbol{x}_i \boldsymbol{\beta}\] for \(x_i \in \{SDI, NCD, COVID\}\) where \(COVID\) represents the reported COVID mortality rate, \(NCD\) are the NCD deaths rates as described in data input as well as the interaction of the SDI and COVID. The code used to predict the scaling factor as well as the estimation of the uncertainty is provided below. Results are in a tab to follow.
#####################################################################################################################################
# Linear regression model using GLM instead of LM incase we change assumption about family
#####################################################################################################################################
rm(list = ls())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
#####################################################################################################################################
# Load packages for all statistical analyses
pacman::p_load(highcharter, data.table, tidyr, dplyr, INLA, xgboost, mlr, countrycode)
# Can load the main data file again
load("Data/analysis_mod.Rda")
#####################################################################################################################################
# Aggregating from quarters to the sum for the entire year
df.lin <- analysis_mod %>%
dplyr::select(Country, iso3, WHO_region, cov_deaths, expected, observed, Nx, ncdr, sdi) %>%
group_by(Country, iso3, WHO_region, Nx, ncdr, sdi) %>%
summarize(covid = sum(cov_deaths), expected = sum(expected),
observed = sum(observed), .groups = "drop") %>%
ungroup() %>%
mutate(excess = observed - expected, expectedc = expected + covid,
ratio = observed/expectedc, covidr = covid/Nx, covsdi = covidr * sdi) %>%
arrange(iso3)
#####################################################################################################################################
# Indices to identify rows
df.lin <- df.lin %>% mutate(row = 1:nrow(df.lin))
ind.obsl <- df.lin %>% filter(!is.na(excess)) %>% pull(row)
#####################################################################################################################################
# Approach
# 1) Run basic linear regression model
# 2) Sample 200 sets of coefficients using distribution given by inverse of variance matrix
# 3) Generate 200 realizations of fitted scaling factors for each country
# 4) Replace scaling factor with observed for places with data
# 5) Apply scaling factors to observed covid + expected to get predicted actual
# 6) Excess is predicted actual - expected
# 7) Use samples to summarise information by country but also for regional summaries
reg.mod <- glm(log(ratio) ~ . - 1, data = df.lin %>% dplyr::select(ratio, sdi, covidr, covsdi, ncdr))
pe <- reg.mod$coefficients # point estimates
vc <- vcov(reg.mod) # variance-covariance
set.seed(1234)
draws <- 200
# Distribution of coefficients
simbetas <- MASS::mvrnorm(draws, pe, vc)
# Input data to scale the coefficients
df.lin.res <- df.lin %>% dplyr::select(dimnames(simbetas)[[2]]) %>% as.matrix()
# Distribution of fitted values (these are scaling factors from the ratio of observed to covid plus expected)
replic <- exp(df.lin.res %*% t(simbetas))
# Use the scaling factors to obtain distribution of excess deaths
fitdist1e <- replic*replicate(draws, df.lin$expectedc) - replicate(draws, df.lin$expected)
obsvals <- df.lin$excess # vector of observed values
# For the non-missing, replace predictions with the actual observed
for (i in 1:draws){
fitdist1e[ind.obsl,] <- obsvals[ind.obsl]
}The syntax used to generate country, regional and global uncertainty bounds can be shown through the link below.
#####################################################################################################################################
# Basic functions to summarise 80% bootstrap intervals or sample from gaussian
#####################################################################################################################################
getsum <- function(x){
tibble(mean = mean(x, na.rm = T),
lwr = quantile(x, 0.1, na.rm = T),
uppr = quantile(x, 0.9, na.rm = T)) %>% round()
}
getsum2 <- function(x){
c(mean(x, na.rm = T), quantile(x, 0.1, na.rm = T), quantile(x, 0.9, na.rm = T))
}
get.sample <- function(x){
rnorm(draws, x[1], x[2])
}
#####################################################################################################################################
# Table of summarized values joined to observed and explanator data
df.lin.out <- data.table::data.table(t(apply(fitdist1e, 1, getsum2))) %>%
rename(y = "V1", low = "10%", high = "90%") %>%
cbind(df.lin) %>%
mutate(y = ifelse(!is.na(excess), excess, y),
low = ifelse(!is.na(excess), NA, low),
high = ifelse(!is.na(excess), NA, high),
source = ifelse(is.na(excess), "Predicted", "Observed")) %>% arrange(-y) %>%
data.frame()
#####################################################################################################################################
# Indices by region
afro.indl <- df.lin %>% filter(WHO_region == "AFRO") %>% pull(row)
emro.indl <- df.lin %>% filter(WHO_region == "EMRO") %>% pull(row)
euro.indl <- df.lin %>% filter(WHO_region == "EURO") %>% pull(row)
paho.indl <- df.lin %>% filter(WHO_region == "AMRO") %>% pull(row)
searo.indl <- df.lin %>% filter(WHO_region == "SEARO") %>% pull(row)
wpro.indl <- df.lin %>% filter(WHO_region == "WPRO") %>% pull(row)
glob.indl <- df.lin %>% pull(row)
#####################################################################################################################################
afro.sum <- getsum(apply(fitdist1e[afro.indl,], 2, sum)) %>%
mutate(region = "AFRO", covid = sum(df.lin[afro.indl,]$covid), Nx = sum(df.lin[afro.indl,]$Nx))
emro.sum <- getsum(apply(fitdist1e[emro.indl,], 2, sum)) %>%
mutate(region = "EMRO", covid = sum(df.lin[emro.indl,]$covid), Nx = sum(df.lin[emro.indl,]$Nx))
euro.sum <- getsum(apply(fitdist1e[euro.indl,], 2, sum)) %>%
mutate(region = "EURO", covid = sum(df.lin[euro.indl,]$covid), Nx = sum(df.lin[euro.indl,]$Nx))
paho.sum <- getsum(apply(fitdist1e[paho.indl,], 2, sum)) %>%
mutate(region = "PAHO", covid = sum(df.lin[paho.indl,]$covid), Nx = sum(df.lin[paho.indl,]$Nx))
searo.sum <- getsum(apply(fitdist1e[searo.indl,], 2, sum)) %>%
mutate(region = "SEARO", covid = sum(df.lin[searo.indl,]$covid), Nx = sum(df.lin[searo.indl,]$Nx))
wpro.sum <- getsum(apply(fitdist1e[wpro.indl,], 2, sum)) %>%
mutate(region = "WPRO", covid = sum(df.lin[wpro.indl,]$covid), Nx = sum(df.lin[wpro.indl,]$Nx))
global.sum <- getsum(apply(fitdist1e[glob.indl,], 2, sum)) %>%
mutate(region = "Global", covid = sum(df.lin[glob.indl,]$covid), Nx = sum(df.lin[glob.indl,]$Nx))
#####################################################################################################################################
sum.exc.df <- rbind(afro.sum, emro.sum, euro.sum,
paho.sum, searo.sum, wpro.sum, global.sum) %>%
rename(low = lwr, high = uppr) %>%
mutate(excess = round(mean - covid),
mean = round(mean), low = round(low), high = round(high),
region = factor(region, levels = c("AFRO", "EMRO", "EURO", "PAHO", "SEARO", "WPRO", "Global")),
model = "Basic Linear") %>%
dplyr::select(model, region, Nx, covid, excess, low, mean, high)A second option is to assume a linear mixed effects model to predict the excess mortality rate. In a linear mixed effects model we let \(y_j = (y_{j1}; \ldots ; y_{jn})\) denote the vector of repeated measurements for a group \(j\). We reconsider the general model \[ y_j = X_j\beta + Zjbj + \varepsilon_j\] in which \(\beta\) is a vector of population-average regression coefficients called fixed effects, and where \(b_j\) is a vector of group-specific regression coefficients. The \(b_j\) describe how the evolution of the \(j\)th group deviates from the average. The matrices \(X_j\) and \(Z_j\) are matrices of known covariates. The residual components \(\varepsilon_j\) are assumed to be independent \(N(0, \Sigma_j)\), where \(\Sigma_j\) depends on \(j\).
In this case the variable of interest is the excess mortality rate i.e., the difference between the observed and expected relative to the overall population per 100,000. As shown in the data input tab, the data are disaggregated by quarter of deaths. For fixed effects we can take the overall covid death rates for the year, the expected death rates, the sdi and quarter specific covid, NCD and communicable disease death rates as well as the quarter. Random effects assumed for the region and income group. The model, is shown in the code below with results to follow in the final tab.
#####################################################################################################################################
# Mixed effects model in INLA, similar syntax to GLM
#####################################################################################################################################
df.mixed <- inla(excessr ~ cd_all + expectedr + sdi * covidr + ncdr + cmr + quarter - 1 +
f(WHO_region, model = "iid") + f(group, model = "iid"),
data = analysis_mod,
control.predictor= list(compute=TRUE))
pred.in <- cbind(df.mixed$summary.fitted.values$mean, df.mixed$summary.fitted.values$sd)
fit.dist3 <- t(apply(pred.in, 1, get.sample)) # Assume distribution of fit is Normal with mean and sd
ind.obs <- analysis_mod %>% filter(!is.na(excess)) %>% pull(row) # indices for no data
#####################################################################################################################################
# For the observed replace predictions with actual excess rate estimates
for (i in 1:draws){
fit.dist3[ind.obs,] <- analysis_mod$excessr[ind.obs]
}
fit.dist3e <- analysis_mod$Nx * fit.dist3
#####################################################################################################################################
# Results are by quarter, need to derive draws by country and then use those to derive draws by region etc
fit.dist3ef1 <- array(dim = c(194, draws))
for (j in 1:194){
fit.dist3ef1[j,] <- apply(fit.dist3e[(j*4 - 3):(j*4),], 2, sum)
}
#####################################################################################################################################
# Summarize with bootstrap confidence intervals based on the distribution of the draws
df.mixed.out <- data.table::data.table(t(apply(fit.dist3ef1, 1, getsum2))) %>%
rename(y = "V1", low = "10%", high = "90%") %>%
cbind(df.lin) %>% filter(!is.na(y)) %>%
mutate(y = ifelse(!is.na(excess), excess, y),
low = ifelse(!is.na(excess), NA, low),
high = ifelse(!is.na(excess), NA, high),
source = ifelse(is.na(excess), "Predicted", "Observed")) %>% arrange(-y)The code for aggregating outputs to generate region specific uncertainty can be shown using the link below.
#####################################################################################################################################
# Identify the rows for each region to aggregate the matrices accordingly
afro.ind <- analysis_mod %>% filter(WHO_region == "AFRO") %>% pull(row)
emro.ind <- analysis_mod %>% filter(WHO_region == "EMRO") %>% pull(row)
euro.ind <- analysis_mod %>% filter(WHO_region == "EURO") %>% pull(row)
paho.ind <- analysis_mod %>% filter(WHO_region == "AMRO") %>% pull(row)
searo.ind <- analysis_mod %>% filter(WHO_region == "SEARO") %>% pull(row)
wpro.ind <- analysis_mod %>% filter(WHO_region == "WPRO") %>% pull(row)
glob.ind <- analysis_mod %>% pull(row)
#####################################################################################################################################
# Summarise within each aggregate, first by column (draw) and then across draws by aggregate
afro.sum.exc <- getsum(apply(fit.dist3e[afro.ind,], 2, sum)) %>%
mutate(region = "AFRO", covid = sum(analysis_mod[afro.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[afro.ind,]$Nx))
emro.sum.exc <- getsum(apply(fit.dist3e[emro.ind,], 2, sum)) %>%
mutate(region = "EMRO", covid = sum(analysis_mod[emro.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[emro.ind,]$Nx))
euro.sum.exc <- getsum(apply(fit.dist3e[euro.ind,], 2, sum)) %>%
mutate(region = "EURO", covid = sum(analysis_mod[euro.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[euro.ind,]$Nx))
paho.sum.exc <- getsum(apply(fit.dist3e[paho.ind,], 2, sum)) %>%
mutate(region = "PAHO", covid = sum(analysis_mod[paho.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[paho.ind,]$Nx))
searo.sum.exc <- getsum(apply(fit.dist3e[searo.ind,], 2, sum)) %>%
mutate(region = "SEARO", covid = sum(analysis_mod[searo.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[searo.ind,]$Nx))
wpro.sum.exc <- getsum(apply(fit.dist3e[wpro.ind,], 2, sum)) %>%
mutate(region = "WPRO", covid = sum(analysis_mod[wpro.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[wpro.ind,]$Nx))
global.sum.exc <- getsum(apply(fit.dist3e[glob.ind,], 2, sum)) %>%
mutate(region = "Global", covid = sum(analysis_mod[glob.ind,]$cov_deaths), Nx = .25*sum(analysis_mod[glob.ind,]$Nx))
#####################################################################################################################################
sum.exc.exc <- rbind(afro.sum.exc, emro.sum.exc, euro.sum.exc,
paho.sum.exc, searo.sum.exc, wpro.sum.exc, global.sum.exc) %>%
rename(low = lwr, high = uppr) %>%
mutate(excess = round(mean - covid),
mean = round(mean), low = round(low), high = round(high),
region = factor(region, levels = c("AFRO", "EMRO", "EURO", "PAHO", "SEARO", "WPRO", "Global")),
model = "Mixed Effects") %>%
dplyr::select(model, region, Nx, covid, excess, low, mean, high)For a country \(c\) with total mortality \(Y_c\) for 2020 we will report \[Y_c-E_c,\] where \(E_c\) is the expected deaths, based on a time series model applied to previous year’s data. For a country \(c\) without totaly mortality we instead use \[\widehat{Y}_c - E_c,\] where \(\widehat{Y}_c\) is the fitted value from a model based on covariates \(X_c\). In the next section we discuss models.
Suppose we have the recorded number of deaths \(Y_c\) in \(2020\) and covariates \(X_c\) for each of \(C\) countries in the AMRO and EURO regions. We will explore \(3\) negative-binomial models for \(Y_c\), which take the form \[Y_c\sim \textsf{Negative-Binomial}(\mu_c, \phi),\] where \(E[Y_c]=\mu_c\) and \(\phi\) is an overdisperison parameter such that \(Var(Y_C)=\mu_c+\mu_c^2/\phi\):
We will then use the fitted models to predict the number of excess deaths for countries without recorded number of deaths in \(2020\).
Before fitting the models, we explore the relationships between the covariates and the responses. First we plot \(\log(Y/N)\) vs. each of the covariates, and overlay a loess smoother:
Next we plot \(\log(Y/E)\) vs. each of the covariates, and overlay a loess smoother:
Finally we plot \(\log((Y-E)/N)\) vs. each of the covariates, and overlay a loess smoother:
We fit each of the \(3\) candidate models in Stan using \(N(0, 5^2)\) priors on regression coefficients and \(N^+(0, 100^2)\) (i.e. a half-normal) priors on the overdispersion parameter (this prior should be explored more). Covariates were centered and scaled to have mean \(0\) and standard deviation \(1\).
We will first examine these fits through residual analyses. In particular for each model, for each of the \(C\) countries we can look at the posterior mean for \(\mu_c\) vs. the observed \(Y_c\), the posterior mean for \(Y_c-\mu_c\) vs. the posterior mean for \(\mu_c\), and the posterior mean for \((Y_c-\mu_c)/\sqrt{\mu_c+\mu_c^2/\phi}\) vs. the posterior mean for \(\mu_c\).
We can additionally perform a leave one out cross validation residual analysis, where we fit each of the models \(C\) times, each time leaving out one of the countries, and then look at the residuals for the left out countries.
In addition to looking at residuals, using leave one out cross validation we can compute the expected log-predictive density (elpd), also known as the log CPO: \[ \sum_{c=1}^C\log\left[\int p(Y_c\mid \theta)p(\theta\mid Y_{-c})d\theta\right] \] where \(\theta\) are all model parameters, \(p(Y_c\mid \theta)\) is the predictive density for country \(c\), and \(p(\theta\mid Y_{-c})\) is the posterior conditional on all of the countries’ data except for country \(c\).
## [1] "Model 1: Offset N"
## Estimate SE
## elpd -685.375 15.40145
## ic 1370.750 30.80290
## [1] "Model 2: Offset E"
## Estimate SE
## elpd -655.3858 16.45395
## ic 1310.7716 32.90789
## [1] "Model 3: Additive"
## Estimate SE
## elpd -651.2184 16.01878
## ic 1302.4369 32.03756
Higher elpd is better (or equivalently lower ic, which is just -2*elpd), and so we see that models 2 and 3 have the best fits according to elpd.
For a given country \(i\) for which we do not have the number of recorded deaths \(Y_i\) in \(2020\), we can predict \(Y_i\) using the fitted \(\mu_i\) under each of the models. For each of the countries in the AMRO and EURO regions which did not have the number of recorded deaths in \(2020\), we now plot the posterior median and \(95\%\) credible intervals for the number of excess deaths, \(\mu_i-E_i\). For the \(C\) countries for which we had the number of recorded deaths in \(2020\), we plot the observed \(Y_c-E_c\)
Here’s the same plot but slightly zoomed in:
We can also see how these excess death estimates compare to the number of reported COVID19 deaths in 2020, for the countries for which we’re making predictions.
Further we can get region level predictions of the number of excess deaths:
## [1] "Model 1: Offset N"
## region est lower upper
## 1 AMRO 1196879 1142166 1257225
## 2 EURO 1099435 1042104 1162357
## [1] "Model 1: Offset E"
## region est lower upper
## 1 AMRO 1418871 1373582 1467314
## 2 EURO 1138242 1096769 1184846
## [1] "Model 1: Additive"
## region est lower upper
## 1 AMRO 1411024 1364107 1471211
## 2 EURO 1143649 1117139 1209144
Overall for the year 2020:
Global: 1.81 million COVID-19 and 3.31 million estimated excess deaths (UI 2.6-3.98)
AFRO: 43 thousand COVID-19 and 106 thousand estimated excess deaths (UI -12-217)
EMRO: 120 thousand COVID-19 and 287 thousand estimated excess deaths (UI 227-346)
EURO: 586 thousand COVID-19 and 1.13 million estimated excess deaths (UI 1.12-1.14)
PAHO: 860 thousand COVID-19 and 1.32 million estimated excess deaths (UI 1.3-1.34)
SEARO: 184 thousand COVID-19 and 446 thousand estimated excess deaths (UI 194-687)
WPRO: 20 thousand COVID-19 and 24 thousand estimated excess deaths (UI -258-282)
Overall for the year 2020:
Global: 1.81 million COVID-19 and 3.31 million estimated excess deaths (UI 3.01-3.62)
AFRO: 43 thousand COVID-19 and 193 thousand estimated excess deaths (UI 123-262)
EMRO: 120 thousand COVID-19 and 385 thousand estimated excess deaths (UI 302-469)
EURO: 586 thousand COVID-19 and 1.11 million estimated excess deaths (UI 1.1-1.12)
PAHO: 860 thousand COVID-19 and 1.33 million estimated excess deaths (UI 1.32-1.34)
SEARO: 184 thousand COVID-19 and 501 thousand estimated excess deaths (UI 281-730)
WPRO: 20 thousand COVID-19 and -205 thousand estimated excess deaths (UI -387–19)